As mentioned in the beginning of lecture, there are (at least) two basic learning setups. In supervised machine learning one observes a response \(Y\) with observing \(p\) different features \(X=(X_1,X_2,\ldots,X_p)\), where we typically postulate the relationship \(Y = f(X)+\varepsilon\) and \(\varepsilon\) independent of \(X\) with mean zero. Here quality is usually assessed by the (test/out-of-sample) error that compares predictions and realizations for a separate dataset. In unsupervised learning, we only observe \(p\) features \(X_1,X_2,\ldots,X_p\), and we would like to learn about their relationship – without focussing on a supervising outcome. Of course, the difficulty is how to assess quality in this case – so different unsupervised learning techniques are very different, and which one to pick will depend on the nature of the problem.

In this tutorial, we will take a closer look at Clustering. There are variety of other techniques, including anomaly detection, self-organizing maps, association analysis, etc.

Clear the slate and install required packages

As before, let’s start our fresh and install required packages, if required:

rm(list=ls())
#install.packages("maps")
#install.packages("mapproj")
#install.packages("NbClust")

Clustering

Background

Clustering refers to techniques for finding subgroups in a given dataset. The typical approach to determine clusters \(C_1,\ldots,C_K\) is to minimize: \[ \sum_{k=1}^K W(C_k), \] where \(W\) is a measure of variation within a cluster. For instance, k-means clustering uses the Euclidean distance to measure variation: \[ W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} = x_{i'j})^2. \] The algorithms are implemented via a greedy algorithm by considering the centers of clusters (referred to as centroids). The number of clusters \(K\) must be chosen beforehand. One approach is hierarchical clustering, where one starts with a larger number of clusters and then fuses custers that are similar (e.g., with regards to the distance between their centroids).

Simulated Example

Let’s consider a very basic simulated example – let’s simulate normal random variables with different means:

set.seed(2)
x = matrix(rnorm(50*2), ncol=2)
x[1:25,1] = x[1:25,1] + 3
x[1:25,2] = x[1:25,2] - 4

And let’s run k means clustering, where we set nstart to 20, meaning that we consider 20 initial configurations. The best one is reported:

km.out = kmeans(x,2,nstart=20)
plot(x, col=(km.out$cluster+ 1), main="K-Means Clustering Results with K=2", xlab="", ylab="", pch=20, cex=2)

So the algorithm was able to identify how we set up the data!

Case Study: County Health Rankings 2013

We analyze County Health Rankings in the US in 2013, based on a data set from the University of Wisconsin Population Health Institute. I took this from a class I taught with Jim Guszcza, and he put this together (it’s much better than what I had before :-).

Data Preparation

Let’s load the data:

raw <- read.csv("countyHealthRR.csv")
dim(raw)
[1] 3141   33
raw[1:3,]
  FIPS   State  County YPLL.Rate Perc.Fair.Poor.Health
1 1001 Alabama Autauga      8376                    23
2 1003 Alabama Baldwin      7770                    13
3 1005 Alabama Barbour      9458                    22
  Physically.Unhealthy.Days Mentally.Unhealthy.Days Perc.Low.Birth.Weight
1                       5.1                     4.3                   9.4
2                       3.3                     3.8                   8.8
3                       5.0                     4.1                  12.3
  Perc.Smokers Perc.Obese Perc.Physically.Inactive Perc.Excessive.Drinking
1           24         34                       32                      17
2           23         26                       25                      19
3           24         37                       34                      13
  MV.Mortality.Rate Chlamydia.Rate Teen.Birth.Rate Perc.Uninsured
1                24            363              43             14
2                20            306              48             19
3                26            645              74             19
  Pr.Care.Physician.Ratio Dentist.Ratio Prev.Hosp.Stay.Rate
1                    2731          3935                  71
2                    1347          2379                  56
3                    2282          3398                 109
  Perc.Diabetic.Screening Perc.Mammography Perc.High.School.Grad
1                      85             68.5                    80
2                      81             69.9                    74
3                      86             64.8                    61
  Perc.Some.College Perc.Unemployed Perc.Children.in.Poverty
1              54.4             8.0                       21
2              62.1             8.1                       21
3              42.1            11.4                       40
  Perc.No.Soc.Emo.Support Perc.Single.Parent.HH Violent.Crime.Rate
1                      24                    26                300
2                      19                    28                215
3                      18                    57                150
  Avg.Daily.Particulates Perc.pop.in.viol Rec.Facility.Rate Perc.Limited.Access
1                   13.3                0               7.3                  10
2                   11.8                8               8.7                   5
3                   12.1                8               3.7                  11
  Perc.Fast.Foods
1              52
2              35
3              56
library(psych)
Warning: package 'psych' was built under R version 3.5.2
describe(raw)[,1:5]
                          vars    n     mean       sd   median
FIPS                         1 3141 30408.40 15152.78 29179.00
State*                       2 3141    27.26    14.26    26.00
County*                      3 3141   927.12   517.87   925.00
YPLL.Rate                    4 2987  8060.13  2407.31  7727.00
Perc.Fair.Poor.Health        5 2743    16.63     5.69    16.00
Physically.Unhealthy.Days    6 2897     3.74     1.05     3.70
Mentally.Unhealthy.Days      7 2862     3.43     0.98     3.40
Perc.Low.Birth.Weight        8 2922     8.29     2.05     7.90
Perc.Smokers                 9 2536    20.52     6.09    20.00
Perc.Obese                  10 3141    30.39     4.17    31.00
Perc.Physically.Inactive    11 3141    27.90     5.16    28.00
Perc.Excessive.Drinking     12 2623    14.52     5.61    15.00
MV.Mortality.Rate           13 2773    22.06    10.50    20.00
Chlamydia.Rate              14 3135   309.33   270.86   232.00
Teen.Birth.Rate             15 2960    45.85    20.23    44.00
Perc.Uninsured              16 3140    18.57     5.61    18.00
Pr.Care.Physician.Ratio     17 2751  2486.47  1535.72  1987.00
Dentist.Ratio               18 2984  3245.48  1814.83  2753.50
Prev.Hosp.Stay.Rate         19 3011    78.75    31.27    73.00
Perc.Diabetic.Screening     20 3094    83.64     6.78    85.00
Perc.Mammography            21 3055    63.08     8.39    63.60
Perc.High.School.Grad       22 3107    82.63    10.03    84.00
Perc.Some.College           23 3141    54.17    11.99    54.20
Perc.Unemployed             24 3140     8.57     2.99     8.35
Perc.Children.in.Poverty    25 3140    24.74     9.02    24.00
Perc.No.Soc.Emo.Support     26 2471    19.39     5.44    19.00
Perc.Single.Parent.HH       27 3137    30.99    10.20    30.00
Violent.Crime.Rate          28 2966   270.89   226.59   211.00
Avg.Daily.Particulates      29 3107    11.08     1.81    11.20
Perc.pop.in.viol            30 3084     9.62    20.11     0.00
Rec.Facility.Rate           31 3141     7.49     7.59     6.70
Perc.Limited.Access         32 3141     8.37     8.22     6.00
Perc.Fast.Foods             33 2989    45.48    13.72    47.00

So it’s a large dataset, and the first three numbers are indices. However, the remaning columns provide rather detailed information for each county. Let’s visualize:

library(corrplot)
corrplot 0.84 loaded
rho <- cor(raw[,-(1:3)], use="pairwise.complete.obs")
col3 <- colorRampPalette(c("red", "white", "blue"))
corrplot(rho, tl.cex=.7, order="hclust", col=col3(50))

corrplot(rho, order="hclust", method="shade", col=col3(50), tl.cex=.7)

So quite a bit of correlation, but nothing seems to be “replicated.” Unfortunately, we have a bunch of missing data. Let’s try to omit missing data:

dat <- raw
dd <- describe(dat)[,1:5] 
dd
                          vars    n     mean       sd   median
FIPS                         1 3141 30408.40 15152.78 29179.00
State*                       2 3141    27.26    14.26    26.00
County*                      3 3141   927.12   517.87   925.00
YPLL.Rate                    4 2987  8060.13  2407.31  7727.00
Perc.Fair.Poor.Health        5 2743    16.63     5.69    16.00
Physically.Unhealthy.Days    6 2897     3.74     1.05     3.70
Mentally.Unhealthy.Days      7 2862     3.43     0.98     3.40
Perc.Low.Birth.Weight        8 2922     8.29     2.05     7.90
Perc.Smokers                 9 2536    20.52     6.09    20.00
Perc.Obese                  10 3141    30.39     4.17    31.00
Perc.Physically.Inactive    11 3141    27.90     5.16    28.00
Perc.Excessive.Drinking     12 2623    14.52     5.61    15.00
MV.Mortality.Rate           13 2773    22.06    10.50    20.00
Chlamydia.Rate              14 3135   309.33   270.86   232.00
Teen.Birth.Rate             15 2960    45.85    20.23    44.00
Perc.Uninsured              16 3140    18.57     5.61    18.00
Pr.Care.Physician.Ratio     17 2751  2486.47  1535.72  1987.00
Dentist.Ratio               18 2984  3245.48  1814.83  2753.50
Prev.Hosp.Stay.Rate         19 3011    78.75    31.27    73.00
Perc.Diabetic.Screening     20 3094    83.64     6.78    85.00
Perc.Mammography            21 3055    63.08     8.39    63.60
Perc.High.School.Grad       22 3107    82.63    10.03    84.00
Perc.Some.College           23 3141    54.17    11.99    54.20
Perc.Unemployed             24 3140     8.57     2.99     8.35
Perc.Children.in.Poverty    25 3140    24.74     9.02    24.00
Perc.No.Soc.Emo.Support     26 2471    19.39     5.44    19.00
Perc.Single.Parent.HH       27 3137    30.99    10.20    30.00
Violent.Crime.Rate          28 2966   270.89   226.59   211.00
Avg.Daily.Particulates      29 3107    11.08     1.81    11.20
Perc.pop.in.viol            30 3084     9.62    20.11     0.00
Rec.Facility.Rate           31 3141     7.49     7.59     6.70
Perc.Limited.Access         32 3141     8.37     8.22     6.00
Perc.Fast.Foods             33 2989    45.48    13.72    47.00
dim(dat)
[1] 3141   33
dim(na.omit(dat))
[1] 1740   33

So unfortunately we are dropping a lot if data. Hence, instead, let’s drop a few variables that lead to a high missing data rate.

keep <- dd$var[dd$n > 2800]
names(dat)
 [1] "FIPS"                      "State"                    
 [3] "County"                    "YPLL.Rate"                
 [5] "Perc.Fair.Poor.Health"     "Physically.Unhealthy.Days"
 [7] "Mentally.Unhealthy.Days"   "Perc.Low.Birth.Weight"    
 [9] "Perc.Smokers"              "Perc.Obese"               
[11] "Perc.Physically.Inactive"  "Perc.Excessive.Drinking"  
[13] "MV.Mortality.Rate"         "Chlamydia.Rate"           
[15] "Teen.Birth.Rate"           "Perc.Uninsured"           
[17] "Pr.Care.Physician.Ratio"   "Dentist.Ratio"            
[19] "Prev.Hosp.Stay.Rate"       "Perc.Diabetic.Screening"  
[21] "Perc.Mammography"          "Perc.High.School.Grad"    
[23] "Perc.Some.College"         "Perc.Unemployed"          
[25] "Perc.Children.in.Poverty"  "Perc.No.Soc.Emo.Support"  
[27] "Perc.Single.Parent.HH"     "Violent.Crime.Rate"       
[29] "Avg.Daily.Particulates"    "Perc.pop.in.viol"         
[31] "Rec.Facility.Rate"         "Perc.Limited.Access"      
[33] "Perc.Fast.Foods"          
names(dat)[keep]
 [1] "FIPS"                      "State"                    
 [3] "County"                    "YPLL.Rate"                
 [5] "Physically.Unhealthy.Days" "Mentally.Unhealthy.Days"  
 [7] "Perc.Low.Birth.Weight"     "Perc.Obese"               
 [9] "Perc.Physically.Inactive"  "Chlamydia.Rate"           
[11] "Teen.Birth.Rate"           "Perc.Uninsured"           
[13] "Dentist.Ratio"             "Prev.Hosp.Stay.Rate"      
[15] "Perc.Diabetic.Screening"   "Perc.Mammography"         
[17] "Perc.High.School.Grad"     "Perc.Some.College"        
[19] "Perc.Unemployed"           "Perc.Children.in.Poverty" 
[21] "Perc.Single.Parent.HH"     "Violent.Crime.Rate"       
[23] "Avg.Daily.Particulates"    "Perc.pop.in.viol"         
[25] "Rec.Facility.Rate"         "Perc.Limited.Access"      
[27] "Perc.Fast.Foods"          
names(dat)[-keep]
[1] "Perc.Fair.Poor.Health"   "Perc.Smokers"           
[3] "Perc.Excessive.Drinking" "MV.Mortality.Rate"      
[5] "Pr.Care.Physician.Ratio" "Perc.No.Soc.Emo.Support"
dim(na.omit(dat[,keep]))
[1] 2387   27

Better. Let’s take a look at the new set:

dat <- dat[,keep]
dim(dat)
[1] 3141   27
describe(dat)[,1:5]
                          vars    n     mean       sd   median
FIPS                         1 3141 30408.40 15152.78 29179.00
State*                       2 3141    27.26    14.26    26.00
County*                      3 3141   927.12   517.87   925.00
YPLL.Rate                    4 2987  8060.13  2407.31  7727.00
Physically.Unhealthy.Days    5 2897     3.74     1.05     3.70
Mentally.Unhealthy.Days      6 2862     3.43     0.98     3.40
Perc.Low.Birth.Weight        7 2922     8.29     2.05     7.90
Perc.Obese                   8 3141    30.39     4.17    31.00
Perc.Physically.Inactive     9 3141    27.90     5.16    28.00
Chlamydia.Rate              10 3135   309.33   270.86   232.00
Teen.Birth.Rate             11 2960    45.85    20.23    44.00
Perc.Uninsured              12 3140    18.57     5.61    18.00
Dentist.Ratio               13 2984  3245.48  1814.83  2753.50
Prev.Hosp.Stay.Rate         14 3011    78.75    31.27    73.00
Perc.Diabetic.Screening     15 3094    83.64     6.78    85.00
Perc.Mammography            16 3055    63.08     8.39    63.60
Perc.High.School.Grad       17 3107    82.63    10.03    84.00
Perc.Some.College           18 3141    54.17    11.99    54.20
Perc.Unemployed             19 3140     8.57     2.99     8.35
Perc.Children.in.Poverty    20 3140    24.74     9.02    24.00
Perc.Single.Parent.HH       21 3137    30.99    10.20    30.00
Violent.Crime.Rate          22 2966   270.89   226.59   211.00
Avg.Daily.Particulates      23 3107    11.08     1.81    11.20
Perc.pop.in.viol            24 3084     9.62    20.11     0.00
Rec.Facility.Rate           25 3141     7.49     7.59     6.70
Perc.Limited.Access         26 3141     8.37     8.22     6.00
Perc.Fast.Foods             27 2989    45.48    13.72    47.00
dim(dat)
[1] 3141   27
dat <- na.omit(dat)
dim(dat)
[1] 2387   27

And let’s throw out IDs:

ID <- dat[,1:3]
summary(ID)
      FIPS                  State             County    
 Min.   : 1001   Texas         : 127   Washington:  29  
 1st Qu.:18164   Georgia       : 118   Franklin  :  22  
 Median :29077   Kentucky      : 100   Jefferson :  22  
 Mean   :30021   Missouri      :  95   Lincoln   :  19  
 3rd Qu.:42092   North Carolina:  91   Jackson   :  18  
 Max.   :56045   Iowa          :  87   Madison   :  16  
                 (Other)       :1769   (Other)   :2261  
dat <- dat[,4:ncol(dat)]

And let’s scale the data so as to make sure we are comparing apples to apples:

dat <- scale(dat)
dat <- as.data.frame(dat)
dat[1:2,]
    YPLL.Rate Physically.Unhealthy.Days Mentally.Unhealthy.Days
1  0.22195709                 1.2741861               0.8531928
2 -0.05159446                -0.5000776               0.3187999
  Perc.Low.Birth.Weight Perc.Obese Perc.Physically.Inactive Chlamydia.Rate
1             0.6386909  0.8490322                0.8371057     0.21310594
2             0.3279272 -1.0541393               -0.4860614    -0.02745043
  Teen.Birth.Rate Perc.Uninsured Dentist.Ratio Prev.Hosp.Stay.Rate
1     -0.06711232     -0.7356113     0.4220123          -0.2013252
2      0.19654323      0.2310834    -0.4604586          -0.7004909
  Perc.Diabetic.Screening Perc.Mammography Perc.High.School.Grad
1               0.1790685        0.6209114            -0.2315002
2              -0.4663623        0.8003722            -0.9040249
  Perc.Some.College Perc.Unemployed Perc.Children.in.Poverty
1      -0.007452166      -0.2813093               -0.3888458
2       0.664958589      -0.2452099               -0.3888458
  Perc.Single.Parent.HH Violent.Crime.Rate Avg.Daily.Particulates
1            -0.5712313         0.08219526               1.177996
2            -0.3510409        -0.30339135               0.335747
  Perc.pop.in.viol Rec.Facility.Rate Perc.Limited.Access Perc.Fast.Foods
1      -0.47427608        -0.1350508           0.5922044       0.5055126
2      -0.03336395         0.0844166          -0.3700773      -0.8828594
describe(dat)[,1:4]
                          vars    n mean sd
YPLL.Rate                    1 2387    0  1
Physically.Unhealthy.Days    2 2387    0  1
Mentally.Unhealthy.Days      3 2387    0  1
Perc.Low.Birth.Weight        4 2387    0  1
Perc.Obese                   5 2387    0  1
Perc.Physically.Inactive     6 2387    0  1
Chlamydia.Rate               7 2387    0  1
Teen.Birth.Rate              8 2387    0  1
Perc.Uninsured               9 2387    0  1
Dentist.Ratio               10 2387    0  1
Prev.Hosp.Stay.Rate         11 2387    0  1
Perc.Diabetic.Screening     12 2387    0  1
Perc.Mammography            13 2387    0  1
Perc.High.School.Grad       14 2387    0  1
Perc.Some.College           15 2387    0  1
Perc.Unemployed             16 2387    0  1
Perc.Children.in.Poverty    17 2387    0  1
Perc.Single.Parent.HH       18 2387    0  1
Violent.Crime.Rate          19 2387    0  1
Avg.Daily.Particulates      20 2387    0  1
Perc.pop.in.viol            21 2387    0  1
Rec.Facility.Rate           22 2387    0  1
Perc.Limited.Access         23 2387    0  1
Perc.Fast.Foods             24 2387    0  1

Exploratory data analysis

Let’s again look at the correlations:

describe(dat)[,1:4]
                          vars    n mean sd
YPLL.Rate                    1 2387    0  1
Physically.Unhealthy.Days    2 2387    0  1
Mentally.Unhealthy.Days      3 2387    0  1
Perc.Low.Birth.Weight        4 2387    0  1
Perc.Obese                   5 2387    0  1
Perc.Physically.Inactive     6 2387    0  1
Chlamydia.Rate               7 2387    0  1
Teen.Birth.Rate              8 2387    0  1
Perc.Uninsured               9 2387    0  1
Dentist.Ratio               10 2387    0  1
Prev.Hosp.Stay.Rate         11 2387    0  1
Perc.Diabetic.Screening     12 2387    0  1
Perc.Mammography            13 2387    0  1
Perc.High.School.Grad       14 2387    0  1
Perc.Some.College           15 2387    0  1
Perc.Unemployed             16 2387    0  1
Perc.Children.in.Poverty    17 2387    0  1
Perc.Single.Parent.HH       18 2387    0  1
Violent.Crime.Rate          19 2387    0  1
Avg.Daily.Particulates      20 2387    0  1
Perc.pop.in.viol            21 2387    0  1
Rec.Facility.Rate           22 2387    0  1
Perc.Limited.Access         23 2387    0  1
Perc.Fast.Foods             24 2387    0  1
corrplot(cor(dat), tl.cex=.7, order="hclust", col=col3(50))

As a heuristic to suggest the appropriate number of clusters, we evaluate how the within sum of squares varies by clusters:

ss <- function(x)  sum( ( x-mean(x) )^2 )
wss <- NULL
wss[1] <- sum( apply(dat,2,ss) )
for (k in 2:10) {
    temp <- kmeans(dat, k)
    wss[k] <- sum(temp$withinss)
}

barplot(wss, col="dodgerblue", names.arg=1:length(wss)
    , xlab="Number of Clusters (k)"
    , ylab="Total Within Sum of Squares")
abline(h=0)
title("Within Sum-of-Squares Analysis", col.main="navy")

Clustering

One way to run the clustering algorithm that allows for a number of “knobs” is NbClust from the library NbClust (e.g., try NbClust(dat, min.nc=3, max.nc=6, method="kmeans")) but it takes a long time to run. So let’s use the more basic package again, where we set the number of clusters to five:

k <- 5
set.seed(652)
km <- kmeans(dat, k)
clust.km <- km$cluster

One way of illustrating a cluster is a so-called dendrogram, which illustrates the hierarchical relationship between objects in a hierarchical clustering algorithm (try to zoom):

dd <- dist(dat, method="euclidean")
hc1 <- hclust(dd, method="average")
hc1 <- hclust(dd, method="complete")
hc1 <- hclust(dd, method="ward.D")
plot(hc1, hang=-1)
rect.hclust(hc1, k=6, border="dodgerblue")
rect.hclust(hc1, k=5, border="blue")
rect.hclust(hc1, k=4, border="red")
rect.hclust(hc1, k=3, border="green")

hc1 <- hclust(dd, method="ward.D")
rect.hclust(hc1, k=5, border="dodgerblue")

clust.hc1 <- cutree(hc1,5)

To aide visualization, let’s relabel clusters in terms of average mortality:

reord <- function(cluster){
    avg <- tapply(scored$YPLL.Rate, cluster, mean); avg
    ord <- order(avg); ord
    clus <- factor(cluster, levels=ord); table(clus)
    levels(clus) <- 1:length(clus)
    return( as.numeric(as.character(clus)) )
}


scored <- dat
scored$clust.km <- reord(clust.km)
scored$clust.hc <- reord(clust.hc1)

Let’s check:

tapply(scored$YPLL.Rate, clust.km, mean)
           1            2            3            4            5 
-0.164393620  1.062612713  1.343340184 -0.007740001 -0.980619550 
tapply(scored$YPLL.Rate, reord(clust.km), mean)
           1            2            3            4            5 
-0.980619550 -0.164393620 -0.007740001  1.062612713  1.343340184 
table(clust.km, reord(clust.km))
        
clust.km   1   2   3   4   5
       1   0 658   0   0   0
       2   0   0   0 399   0
       3   0   0   0   0 234
       4   0   0 457   0   0
       5 639   0   0   0   0
tapply(scored$YPLL.Rate, clust.hc1, mean)
         1          2          3          4          5 
 0.6839265 -0.1262175  1.3095073 -1.0664764 -0.5250699 
tapply(scored$YPLL.Rate, reord(clust.hc1), mean)
         1          2          3          4          5 
-1.0664764 -0.5250699 -0.1262175  0.6839265  1.3095073 
table(clust.hc1, reord(clust.hc1))
         
clust.hc1   1   2   3   4   5
        1   0   0   0 767   0
        2   0   0 342   0   0
        3   0   0   0   0 204
        4 341   0   0   0   0
        5   0 733   0   0   0
table(scored$clust.km, scored$clust.hc)
   
      1   2   3   4   5
  1 305 324   9   1   0
  2   5 368  62 223   0
  3  31  35 268 114   9
  4   0   6   0 380  13
  5   0   0   3  49 182

So the two algorithms produce similar – but far from equivalent – results.

PCA

Let’s perform a PCA:

pc1 <- prcomp(dat)
pc1 <- prcomp(scale(dat))
round(pc1$rotation[,1:2], 3)
                             PC1    PC2
YPLL.Rate                 -0.301  0.042
Physically.Unhealthy.Days -0.235  0.189
Mentally.Unhealthy.Days   -0.190  0.122
Perc.Low.Birth.Weight     -0.248 -0.158
Perc.Obese                -0.222  0.126
Perc.Physically.Inactive  -0.249  0.238
Chlamydia.Rate            -0.179 -0.407
Teen.Birth.Rate           -0.288 -0.051
Perc.Uninsured            -0.206 -0.053
Dentist.Ratio             -0.128  0.293
Prev.Hosp.Stay.Rate       -0.216  0.273
Perc.Diabetic.Screening    0.136  0.064
Perc.Mammography           0.194 -0.160
Perc.High.School.Grad      0.165  0.295
Perc.Some.College          0.262 -0.195
Perc.Unemployed           -0.193 -0.084
Perc.Children.in.Poverty  -0.303 -0.073
Perc.Single.Parent.HH     -0.246 -0.310
Violent.Crime.Rate        -0.148 -0.407
Avg.Daily.Particulates    -0.115  0.128
Perc.pop.in.viol          -0.039  0.087
Rec.Facility.Rate          0.160 -0.175
Perc.Limited.Access       -0.073 -0.167
Perc.Fast.Foods           -0.128 -0.084

Let’s compute the PCs and check their correlation:

pcs <- predict(pc1) 
describe(pcs)[,1:5]
     vars    n mean   sd median
PC1     1 2387    0 2.87   0.18
PC2     2 2387    0 1.57   0.12
PC3     3 2387    0 1.39   0.27
PC4     4 2387    0 1.12   0.06
PC5     5 2387    0 1.09  -0.01
PC6     6 2387    0 0.97  -0.13
PC7     7 2387    0 0.91  -0.02
PC8     8 2387    0 0.87  -0.03
PC9     9 2387    0 0.85  -0.01
PC10   10 2387    0 0.84  -0.02
PC11   11 2387    0 0.79  -0.05
PC12   12 2387    0 0.76   0.03
PC13   13 2387    0 0.71  -0.01
PC14   14 2387    0 0.68   0.03
PC15   15 2387    0 0.67   0.01
PC16   16 2387    0 0.65   0.00
PC17   17 2387    0 0.58  -0.01
PC18   18 2387    0 0.55   0.00
PC19   19 2387    0 0.54  -0.01
PC20   20 2387    0 0.50   0.00
PC21   21 2387    0 0.48  -0.01
PC22   22 2387    0 0.45   0.02
PC23   23 2387    0 0.43   0.01
PC24   24 2387    0 0.39   0.00
dim(pcs); dim(dat)
[1] 2387   24
[1] 2387   24
corrplot(cor(pcs))  

So as expected, the PCs are uncorrelated.

Let’s prepare a scee plot:

vars <- apply(pcs, 2, var)
sum(vars); ncol(dat); ncol(pcs)
[1] 24
[1] 24
[1] 24
barplot(vars[1:10], col="lightblue", ylab="variance", las=1)
title("Principal Components Analysis Scree Plot", col.main="navy")
abline(h=1:7, col="darkcyan")
abline(h=0)

plot(pc1)    

summary(pc1)
Importance of components:
                          PC1    PC2     PC3    PC4     PC5     PC6     PC7
Standard deviation     2.8682 1.5704 1.39358 1.1225 1.08510 0.96611 0.91362
Proportion of Variance 0.3428 0.1027 0.08092 0.0525 0.04906 0.03889 0.03478
Cumulative Proportion  0.3428 0.4455 0.52643 0.5789 0.62799 0.66688 0.70166
                           PC8    PC9    PC10    PC11    PC12    PC13   PC14
Standard deviation     0.87179 0.8471 0.83583 0.79341 0.76144 0.70866 0.6752
Proportion of Variance 0.03167 0.0299 0.02911 0.02623 0.02416 0.02092 0.0190
Cumulative Proportion  0.73333 0.7632 0.79233 0.81856 0.84272 0.86365 0.8826
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     0.66962 0.64511 0.58329 0.55006 0.53940 0.50234 0.47829
Proportion of Variance 0.01868 0.01734 0.01418 0.01261 0.01212 0.01051 0.00953
Cumulative Proportion  0.90132 0.91866 0.93284 0.94545 0.95757 0.96809 0.97762
                          PC22    PC23    PC24
Standard deviation     0.44520 0.43295 0.38929
Proportion of Variance 0.00826 0.00781 0.00631
Cumulative Proportion  0.98588 0.99369 1.00000

So the first PC seems to expplain a lot of the variation. We can illustrate via a so-called bi-plot:

biplot(pc1, col=c("slategrey", "navy"), cex=c(.2, .8))

round(pc1$rotation, 4)[,1:2]
                              PC1     PC2
YPLL.Rate                 -0.3011  0.0418
Physically.Unhealthy.Days -0.2354  0.1891
Mentally.Unhealthy.Days   -0.1896  0.1225
Perc.Low.Birth.Weight     -0.2480 -0.1581
Perc.Obese                -0.2217  0.1260
Perc.Physically.Inactive  -0.2486  0.2385
Chlamydia.Rate            -0.1792 -0.4068
Teen.Birth.Rate           -0.2879 -0.0511
Perc.Uninsured            -0.2056 -0.0526
Dentist.Ratio             -0.1277  0.2932
Prev.Hosp.Stay.Rate       -0.2163  0.2732
Perc.Diabetic.Screening    0.1359  0.0640
Perc.Mammography           0.1936 -0.1602
Perc.High.School.Grad      0.1648  0.2953
Perc.Some.College          0.2622 -0.1952
Perc.Unemployed           -0.1933 -0.0842
Perc.Children.in.Poverty  -0.3026 -0.0726
Perc.Single.Parent.HH     -0.2457 -0.3103
Violent.Crime.Rate        -0.1479 -0.4066
Avg.Daily.Particulates    -0.1146  0.1276
Perc.pop.in.viol          -0.0391  0.0866
Rec.Facility.Rate          0.1605 -0.1752
Perc.Limited.Access       -0.0726 -0.1669
Perc.Fast.Foods           -0.1279 -0.0835

We can visualize our clusters in PC space:

col <- c("blue","dodgerblue","lightgreen","pink","red")

par(mfrow=c(1,2))

clust <- scored$clust.km
plot(pcs, type="n", main="k-means")
text(pcs, labels=clust, col=col[clust])
abline(h=0); abline(v=0)


clust <- scored$clust.hc
plot(pcs, type="n", main="hierarchical clustering")
text(pcs, labels=clust, col=col[clust])
abline(h=0); abline(v=0)

What do you think?

Analyzing the Clusters

Let’s consider aggregate stats by variable of the clusters:

col <- c("blue", "dodgerblue", "lightgreen", "pink", "red", "maroon", "darkorange")

clust <- scored[, "clust.km"]
agg <- function(x) tapply(x, clust, mean)
summ <- apply(dat, 2, agg)
t(round(summ,2))
                              1     2     3     4     5
YPLL.Rate                 -0.98 -0.16 -0.01  1.06  1.34
Physically.Unhealthy.Days -0.84 -0.09 -0.04  1.28  0.42
Mentally.Unhealthy.Days   -0.71 -0.06  0.03  1.04  0.27
Perc.Low.Birth.Weight     -0.74 -0.24  0.09  0.43  1.80
Perc.Obese                -0.73  0.13 -0.33  0.59  1.26
Perc.Physically.Inactive  -0.84  0.16 -0.43  1.03  0.94
Chlamydia.Rate            -0.47 -0.41  0.48 -0.24  1.91
Teen.Birth.Rate           -1.02 -0.22  0.34  0.90  1.21
Perc.Uninsured            -0.88 -0.12  0.53  0.63  0.63
Dentist.Ratio             -0.47  0.26 -0.47  0.63  0.41
Prev.Hosp.Stay.Rate       -0.64 -0.01 -0.41  1.18  0.55
Perc.Diabetic.Screening    0.43  0.12 -0.12 -0.35 -0.68
Perc.Mammography           0.67 -0.01  0.07 -0.86 -0.47
Perc.High.School.Grad      0.56  0.37 -0.63 -0.11 -1.16
Perc.Some.College          1.04 -0.17  0.14 -1.02 -0.88
Perc.Unemployed           -0.74 -0.08  0.32  0.35  1.04
Perc.Children.in.Poverty  -1.03 -0.21  0.29  0.81  1.48
Perc.Single.Parent.HH     -0.78 -0.31  0.47  0.13  1.86
Violent.Crime.Rate        -0.48 -0.38  0.69 -0.19  1.34
Avg.Daily.Particulates    -0.51  0.22 -0.22  0.47  0.39
Perc.pop.in.viol          -0.09 -0.04 -0.08  0.32 -0.01
Rec.Facility.Rate          0.63 -0.15  0.11 -0.59 -0.50
Perc.Limited.Access       -0.21 -0.25  0.48 -0.07  0.47
Perc.Fast.Foods           -0.39 -0.18  0.13  0.39  0.65

And let’s include the PCs:

scored2 <- dat
scored2$PC1 <- pcs[,1]
scored2$PC2 <- pcs[,2]
scored2$PC3 <- pcs[,3]
clust <- scored[, "clust.km"]
agg <- function(x) tapply(x, clust, mean)
SUMMARY <- apply(scored2, 2, agg)
t(round(SUMMARY,2))
                              1     2     3     4     5
YPLL.Rate                 -0.98 -0.16 -0.01  1.06  1.34
Physically.Unhealthy.Days -0.84 -0.09 -0.04  1.28  0.42
Mentally.Unhealthy.Days   -0.71 -0.06  0.03  1.04  0.27
Perc.Low.Birth.Weight     -0.74 -0.24  0.09  0.43  1.80
Perc.Obese                -0.73  0.13 -0.33  0.59  1.26
Perc.Physically.Inactive  -0.84  0.16 -0.43  1.03  0.94
Chlamydia.Rate            -0.47 -0.41  0.48 -0.24  1.91
Teen.Birth.Rate           -1.02 -0.22  0.34  0.90  1.21
Perc.Uninsured            -0.88 -0.12  0.53  0.63  0.63
Dentist.Ratio             -0.47  0.26 -0.47  0.63  0.41
Prev.Hosp.Stay.Rate       -0.64 -0.01 -0.41  1.18  0.55
Perc.Diabetic.Screening    0.43  0.12 -0.12 -0.35 -0.68
Perc.Mammography           0.67 -0.01  0.07 -0.86 -0.47
Perc.High.School.Grad      0.56  0.37 -0.63 -0.11 -1.16
Perc.Some.College          1.04 -0.17  0.14 -1.02 -0.88
Perc.Unemployed           -0.74 -0.08  0.32  0.35  1.04
Perc.Children.in.Poverty  -1.03 -0.21  0.29  0.81  1.48
Perc.Single.Parent.HH     -0.78 -0.31  0.47  0.13  1.86
Violent.Crime.Rate        -0.48 -0.38  0.69 -0.19  1.34
Avg.Daily.Particulates    -0.51  0.22 -0.22  0.47  0.39
Perc.pop.in.viol          -0.09 -0.04 -0.08  0.32 -0.01
Rec.Facility.Rate          0.63 -0.15  0.11 -0.59 -0.50
Perc.Limited.Access       -0.21 -0.25  0.48 -0.07  0.47
Perc.Fast.Foods           -0.39 -0.18  0.13  0.39  0.65
PC1                        3.44  0.44 -0.42 -3.02 -4.65
PC2                       -0.14  0.85 -1.51  1.59 -1.79
PC3                        0.02  0.14 -0.37 -0.11  0.44

Let’s visualize:

names(dat)
 [1] "YPLL.Rate"                 "Physically.Unhealthy.Days"
 [3] "Mentally.Unhealthy.Days"   "Perc.Low.Birth.Weight"    
 [5] "Perc.Obese"                "Perc.Physically.Inactive" 
 [7] "Chlamydia.Rate"            "Teen.Birth.Rate"          
 [9] "Perc.Uninsured"            "Dentist.Ratio"            
[11] "Prev.Hosp.Stay.Rate"       "Perc.Diabetic.Screening"  
[13] "Perc.Mammography"          "Perc.High.School.Grad"    
[15] "Perc.Some.College"         "Perc.Unemployed"          
[17] "Perc.Children.in.Poverty"  "Perc.Single.Parent.HH"    
[19] "Violent.Crime.Rate"        "Avg.Daily.Particulates"   
[21] "Perc.pop.in.viol"          "Rec.Facility.Rate"        
[23] "Perc.Limited.Access"       "Perc.Fast.Foods"          
show <- 1:8 
show <- 9:16
show <- 17:22
#show <- 26:30
par(mfrow=c(2,4))
for(i in show){
    boxplot( scored[,i] ~ clust, col=col, varwidth=TRUE)
    abline(h=0, col="navy")
    title(names(scored)[i])
}

So there is quite some variation. To make sense out of all this, let’s visualize the clusters on a map, where we align data with map definitions by matching FIPS codes (using the built-in dataset used by the maps() function):

library(maps)
library(mapproj)

par(mfrow=c(1,1))
data(county.fips)   # 
head(county.fips)
  fips        polyname
1 1001 alabama,autauga
2 1003 alabama,baldwin
3 1005 alabama,barbour
4 1007    alabama,bibb
5 1009  alabama,blount
6 1011 alabama,bullock
matches <- clust[match(county.fips$fips, ID$FIPS)]
map("county", col = col[matches], fill = TRUE, resolution = 0,
    lty = 0, projection = "polyconic")
map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.2,
    projection="polyconic")
title("County Cluster Assignments Based on Population Health Data", col.main="navy")
txt <- as.character(1:length(unique(clust)))
legend("bottomleft", txt, horiz=TRUE, fill=col)